Data loading and structure

How the data look like

head(bike, n=4)
##            ride_id rideable_type          started_at            ended_at
## 1 726C3A99FFCAE10C  classic_bike 2022-08-18 18:08:30 2022-08-18 19:00:37
## 2 F1AC3AED5E7498FB  classic_bike 2022-08-11 18:28:21 2022-08-11 18:44:35
## 3 9C93876268A75FD7  classic_bike 2022-08-28 19:40:43 2022-08-28 20:52:50
## 4 45AFFC2B7A7BD7C9  classic_bike 2022-08-15 20:21:00 2022-08-15 20:44:37
##                 start_station_name start_station_id
## 1 Grandview Library at Oakland Ave               79
## 2          Jaeger St & Whittier St               59
## 3           High St & Crestview Rd               88
## 4           High St & Crestview Rd               88
##                   end_station_name end_station_id start_lat start_lng  end_lat
## 1 Grandview Library at Oakland Ave             79  39.98193 -83.04898 39.98193
## 2          Jaeger St & Whittier St             59  39.94460 -82.98950 39.94460
## 3           High St & Crestview Rd             88  40.02252 -83.01364 40.02252
## 4           High St & Crestview Rd             88  40.02252 -83.01364 40.02252
##     end_lng member_casual
## 1 -83.04898        member
## 2 -82.98950        member
## 3 -83.01364        member
## 4 -83.01364        casual

Data preparing

Dealing with NA values

print(paste(sum(is.na(bike)), "number of NA in the data"))
## [1] "2131 number of NA in the data"
sapply(bike, function(y) sum(length(which(is.na(y)))))
##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##                  0                870                  0               1231 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0                 15                 15 
##      member_casual 
##                  0

Most of Na in the columns start_station_id and and end_station_id. However, since the name of the station is present I don’t think we should delete the whole row!

Change columns type

# change rideable_type/ member_casual to factor 
#there three numbers under the factor rideable_type needs to look into!
bike$rideable_type <- factor(bike$rideable_type)
bike$member_casual <- factor(bike$member_casual)

# split started_at/ ended_at to date column and time column
# change start_date/end_date to date type
bike$start_time <- format(as.POSIXct(bike$started_at), format = "%H:%M:%S") 
bike$end_time <- format(as.POSIXct(bike$ended_at), format = "%H:%M:%S") 
bike$start_date <- as.Date(bike$started_at)
bike$end_date <- as.Date(bike$ended_at)

# Check the structure again
str(bike)
## 'data.frame':    7416 obs. of  17 variables:
##  $ ride_id           : chr  "726C3A99FFCAE10C" "F1AC3AED5E7498FB" "9C93876268A75FD7" "45AFFC2B7A7BD7C9" ...
##  $ rideable_type     : Factor w/ 3 levels "classic_bike",..: 1 1 1 1 3 3 1 1 1 1 ...
##  $ started_at        : chr  "2022-08-18 18:08:30" "2022-08-11 18:28:21" "2022-08-28 19:40:43" "2022-08-15 20:21:00" ...
##  $ ended_at          : chr  "2022-08-18 19:00:37" "2022-08-11 18:44:35" "2022-08-28 20:52:50" "2022-08-15 20:44:37" ...
##  $ start_station_name: chr  "Grandview Library at Oakland Ave" "Jaeger St & Whittier St" "High St & Crestview Rd" "High St & Crestview Rd" ...
##  $ start_station_id  : num  79 59 88 88 88 88 55 55 88 54 ...
##  $ end_station_name  : chr  "Grandview Library at Oakland Ave" "Jaeger St & Whittier St" "High St & Crestview Rd" "High St & Crestview Rd" ...
##  $ end_station_id    : num  79 59 88 88 88 88 48 62 110 54 ...
##  $ start_lat         : num  40 39.9 40 40 40 ...
##  $ start_lng         : num  -83 -83 -83 -83 -83 ...
##  $ end_lat           : num  40 39.9 40 40 40 ...
##  $ end_lng           : num  -83 -83 -83 -83 -83 ...
##  $ member_casual     : Factor w/ 2 levels "casual","member": 2 2 2 1 1 1 1 2 2 1 ...
##  $ start_time        : chr  "18:08:30" "18:28:21" "19:40:43" "20:21:00" ...
##  $ end_time          : chr  "19:00:37" "18:44:35" "20:52:50" "20:44:37" ...
##  $ start_date        : Date, format: "2022-08-18" "2022-08-11" ...
##  $ end_date          : Date, format: "2022-08-18" "2022-08-11" ...

Data analysis & visualazation

How many users of each membership type we have?

bike %>% count(member_casual)
##   member_casual    n
## 1        casual 4069
## 2        member 3347
ggplot(bike, aes(member_casual, fill = member_casual))+
  geom_bar()+
  scale_fill_brewer(palette = "BuPu")+
  guides(fill="none")+
  labs(title = "User membersip types", x= "types of memebership")+
  theme_classic()

There are more causal users (24-hour pass or 3-day pass user) than annual members users by around 1000 user difference on August 2022. Also, there are two types of the causal users which are Single trip cost 2.25$ per 30min and 8$ for unlimited 30min ride in a day, annual membership on the other hand cost 85$ a year.

What is the most frequent bike type used?

ggplot(bike, aes(rideable_type, fill = rideable_type))+
  geom_bar()+
  scale_fill_brewer(palette = "BuPu")+
  guides(fill="none")+
  labs(title = "Types of used biks", x = "")+
  theme_classic()

There are few users of docked_bike type comparing to the others. Docked bike is a bicycles that can be borrowed or rented from an automated station or “docking stations”. It is interesting why would people prefer other types above this type! Therefore we recommend the company to not invest in this type.

casual members prefer watch type? VS members

# grouping types of users and counting their used bike type without counting docked_bike because it is only 3 users
members_preferance <- bike %>% group_by(member_casual, rideable_type)%>%
  filter(rideable_type != "docked_bike")%>%
  summarise(used = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
print(members_preferance)
## # A tibble: 4 × 3
## # Groups:   member_casual [2]
##   member_casual rideable_type  used
##   <fct>         <fct>         <int>
## 1 casual        classic_bike   1689
## 2 casual        electric_bike  2377
## 3 member        classic_bike   1712
## 4 member        electric_bike  1635
ggplot(members_preferance, aes(x= member_casual,y = used , fill = rideable_type))+
  geom_bar(position='dodge', stat='identity')+
  scale_fill_brewer(palette = "BuPu")+
  labs(title = "Most used bike type to user", x= "type of user", y="")+
  theme_classic()

While there is no huge difference between annual members in choosing classic or electric bikes, casual members choose to use electric bikes over the classic by around 680 user.

contingency table between customer type and bike type

#the probability of each user to pick this type of bike
round(table(bike$member_casual, bike$rideable_type), 2)
##         
##          classic_bike docked_bike electric_bike
##   casual         1689           3          2377
##   member         1712           0          1635

While there is almost even number of the annual member choose electric or classic bike,casual users are more likely to choose electric bike.

Which day of the week the serves is used more?

#extrat only the day and convert it to day of the week
bike$days <- format(bike$start_date, format = "%a")
#convert it to a factor and organize the days order
bike$days <- factor(bike$days, levels = c("Sat", "Sun", "Mon", "Tue", "Wed", "Thu", "Fri" ))


ggplot(bike, aes(days, fill = days))+
  geom_bar()+
  scale_fill_brewer(palette = "BuPu")+
  guides(fill="none")+
  labs(title = "Number of users in the days of the week", x="Days of the week")+
  theme_classic()

Saturdays and Wednesdays have the most number of users but overall there is no big difference between the days of the week in the count of users.

When is the highest-lowest time of use of the day?

#get only the hour from the time
bike$hour <- NA 
bike$hour <- hour(bike$started_at)
sum_hour <- bike %>%
            group_by(hour) %>%
            summarise(sum_hour = length(hour)) 

ggplot(sum_hour, aes(hour, sum_hour ))+
  geom_line(color = "#8C6BB1", size = 1) +
  geom_point(color = "#8C96C6", size = 2) +
  scale_x_continuous(breaks=seq(0,23,1))+
  labs(title="Use by hour", y = "")+
  theme_classic()

The peak hours of August is between 3:00pm to 8:00pm in range of 200 user.

What is the hourly use of each day?

sum_hour <- bike %>%
            group_by(days, hour)%>% summarise(count = n())
## `summarise()` has grouped output by 'days'. You can override using the
## `.groups` argument.
ggplot(data = sum_hour, aes(x = hour, y = count,  color = days))+
  geom_point() + geom_line(aes(group = 1))+
  facet_grid(rows = vars(days))+         
  scale_color_manual(values=c("#BFD3E6", "#9EBCDA" ,"#8C96C6" ,"#8C6BB1", "#88419D", "#810F7C", "#4D004B"))+
  labs(title= "Use by day and hour")+
  scale_y_continuous(breaks=seq(0,130,50))+ 
  scale_x_continuous(breaks=seq(0,23,1))+
    theme(
    plot.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank()
  )  

During the weekend hours, the rushing hour start at 9:00am while in the weekdays it starts earlier at 6:00am. Also, in most of the weekdays the line does not drop until 10:00pm but it drop a little earlier during the weekends at 9:00pm.

Where are the most used stations

library(mapview)
#subset without the na 
end_station <- subset(bike, (!is.na(bike[,11])) & (!is.na(bike[,12])))

#have the car for ohaio
mapview(bike, xcol = "start_lat", ycol = "start_lng", crs = 3730, grid = FALSE, lable = "Start Station")
mapview(end_station, xcol = "end_lat", ycol = "end_lng",crs = 3735, grid = FALSE)
#
#
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(ggplot2)
register_google(key = )

bikemap <-ggmap(get_googlemap(center = c(lon = -82.99879, lat=  39.96118),
                    maptype = 'terrain',
                    color = "color",
                    zoom = 11))
## Source : https://maps.googleapis.com/maps/api/staticmap?center=39.96118,-82.99879&zoom=11&size=640x640&scale=2&maptype=terrain&key=xxx-nSWcLKpCjCjcG2CuuXETG_yg
geom_point(data = end_station, aes(x =end_lng, y = end_lat), size = 10, color = "red")
## mapping: x = ~end_lng, y = ~end_lat 
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
print(bikemap) 

Unfortunately I could not apply the data to the graph! but I worked days on it even I had my Google map API for it so I will leave it for hard work recognition =)

#Load modified data with the pricing and distance
pric <- read.csv("202207-cogo-tripdata.csv")
price <- subset(pric,select=c(member_casual, Pricing, Distance, rideable_type))

How much casual members spend?

casual_price <- subset(price,member_casual == "casual")
round(summary(casual_price$Pricing),2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.25    2.25    2.40    2.79    2.40    6.75       4

The average spent of casual members is 2,40$ and maximum of 8.75$

ggplot(casual_price,aes(Pricing))+
  geom_histogram(fill = "#8C6BB1", alpha=.5, bins = 30, na.rm = T)+
  labs(title = "Casual customesr payments")+
  theme_classic()

The histogram is a right skewed and there are ouliers.

invistigate the outliers of the price

What type of bikes they use?

ggplot(casual_price, aes(rideable_type, Pricing))+
  geom_boxplot(outlier.colour = "#810F7C", na.rm =T)+
  scale_fill_brewer(palette = "BuPu")+
  guides(fill="none")+
  expand_limits(y = 2)+
  labs(title = "Check the outliers comparaing to bike type", x = "")+
  theme_classic()

By comparing the mass of the two types the classic_bike has bigger mass from 2.50$ to above 3$ while people who use electric_bike the first quartile to the third quartile is around 2.50$. Ther are more autliers in electric_bike than classic_bike but not that big difference.

After deleteig the ouliers

narm_pr<-outlierKD2(casual_price, Pricing, histogram = T)
## Outliers identified: 1054 
## Proportion (%) of outliers: 27.3 
## Mean of the outliers: 4.44 
## Mean without removing outliers: 2.79 
## Mean if we remove outliers: 2.34 
## Nothing changed
nprice <- nrow(price)
nrmprice <- nrow(narm_pr)

After deleting the outliers of the price 7790 rows, it became 4918 less by 2872 and the shape changed .

#the mode function
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
payment <- narm_pr%>% 
           group_by(rideable_type)%>%
           summarise(sum_price = round(sum(Pricing),2),
           mean_price = round(mean(Pricing),2),
           mode_price = getmode(Pricing))

xkabledply(payment)
Model: rideable_type ~ sum_price + mean_price + mode_price
rideable_type sum_price mean_price mode_price
classic_bike 6047.4 2.82 2.25
docked_bike NA NA NA
electric_bike 7675.5 2.77 2.40

The average price of people who use electric_bike is the same as classic_bike. However, the mode is different that there are more customers who use electric bike pay 0.15$ than the classic_bike. The sum meney of electric_bike is more than classic_bike but it is understandable because the number of users are more.

Is there a correlation between time and day of usage?

#glm(hour ~ days, data = bike, family = )
**What is the average trip distance?**

**probopality of user membership and distance time**

```r
#y1 = contenuse, y2 = binary
#t.test(dis_time ~ member_casual, var.equal = FALSE)